Some Unsupervised Learning Model
Unsupervised learning is a the study without labels, normally, is the task of grouping, explaining and finding structured data.
Clustering
Clustering is the task of organizing data into groups/clusters, we always use K-means algorithm to do it. Normally, when we see that multiple clusters shows in group we call its multimodal distributed, and the process of grouping those modes into clusters without label is clustering.
K-means algorithm is to partition the data into K distinct, non-overlapping clusters. Let be the clusters, and the union of set is the whole data set. Since non-overlapping, then for all i, j, . The key idea is to select a good clustering ensures the within-cluster variation as small as possible and we define within-cluster variation as which is a measure on the difference among observations with a cluster. We want to find to minimize .
- Common to use Euclidean distance as the measure of difference, where is the number of observations in cluster .
- let be the centroid of cluster , then .
- There are ways to partition observations into clusters which is much computing expensive.
Steps:
- Initialization: randomly initialize cluster centers to each of the observations.
- Then iteratively alternates between two steps until assignment stop changing :
- Assignment step: Assign each observation to the closest cluster
- decrease the total within-cluster variation.
- Re-center step: Move each cluster center to the mean of the data assigned to it.
- decrease the total within-cluster variation.
- Assignment step: Assign each observation to the closest cluster
We may add some extension for K-means algorithm:
- Non-exhaustive clustering: allow some of the data points not to belong to any cluster.
- Overlapping clustering: allow some data points to belong to more than one cluster (a.k.a K-means).
Principal Component Analysis
Principal Component Analysis (PCA) is used for dimensionality reduction where map data to a lower dimensional space. The idea of PCA is finds linear low-dimensional representations of the data by preserving as much variation as possible.
We define the th principal component(PC) as where , and we have constraint wehre:
- has the largest variance
- should be normalized where and we call it the loading vector. We select by or the matrix form . Moreover, it's the th eigenvectores of
- each is uncorrelated with for .
Givne a data frame , we want to construct PCs with , we can use the following steps:
- Center X such that the4 columns have zero mean: where and is the matrix so that is a matrix.
- Compute the first loading from the centered data .
- Obtain the first K PCs .
- Add the centers back to the PCs .